Improve megatron dataset preprocessing script and update docs#918
Improve megatron dataset preprocessing script and update docs#918kevalmorabia97 wants to merge 1 commit intomainfrom
Conversation
📝 WalkthroughWalkthroughThe PR refactors the Megatron preprocessing API from a single Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #918 +/- ##
==========================================
- Coverage 73.11% 73.10% -0.02%
==========================================
Files 205 205
Lines 22281 22281
==========================================
- Hits 16291 16288 -3
- Misses 5990 5993 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
172-179:⚠️ Potential issue | 🟠 Major
UnboundLocalErroroniwhen processing an empty JSONL file.
iis assigned only inside theenumerateloop at line 172. Ifencoded_docsyields no items (empty file),iis never defined, and the newforce_printcall on line 179 raisesUnboundLocalError.🐛 Proposed fix
- total_doc_len, total_enc_len, final_enc_len = 0, 0, 0 + total_doc_len, total_enc_len, final_enc_len, i = 0, 0, 0, 0 for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(encoded_docs, start=1):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 172 - 179, The loop variable i is undefined when encoded_docs is empty, causing UnboundLocalError at the final self._print_processing_stats(i, ...) call; fix by initializing i (e.g., i = 0) before the for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(...) loop or by using a safe fallback when calling self._print_processing_stats (e.g., pass i if defined else 0); update the code around the enumerate block in megatron_preprocess_data.py so that i is always defined before the final force_print call.
🧹 Nitpick comments (2)
modelopt/torch/prune/plugins/mcore_minitron.py (1)
320-320: LGTM — consider opening a tracking issue for this rename.The TODO comment clearly documents the pending rename of
hybrid_override_pattern→hybrid_layer_patternonce Megatron-LM#3377 merges. No logic or behavior change.Would you like me to draft a tracking issue for this rename so it doesn't get lost?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/prune/plugins/mcore_minitron.py` at line 320, The comment TODO notes that the name hybrid_override_pattern should be renamed to hybrid_layer_pattern after Megatron-LM#3377; create a tracking issue in your repo (or project board) documenting this rename, include the file and symbol names (hybrid_override_pattern in modelopt/torch/prune/plugins/mcore_minitron.py), mention the PR that triggers it (NVIDIA/Megatron-LM#3377), and list required follow-ups (update the TODO comment, rename the symbol across codebase, adjust any docs/tests that reference hybrid_override_pattern) so the change won't be lost.modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)
271-271: Mutable default argument forjson_keys.
json_keys: list[str] = ["text"]shares a single list object across all callers that omit the argument. While the list is currently only read (not mutated) inside the function, this is a Python antipattern that could cause subtle bugs if the implementation changes.♻️ Proposed fix
- json_keys: list[str] = ["text"], + json_keys: list[str] | None = None,Then at the top of the function body:
if json_keys is None: json_keys = ["text"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` at line 271, The parameter json_keys is defined with a mutable default list (`json_keys: list[str] = ["text"]`), so change the function signature to use None as the default (e.g., `json_keys: list[str] | None = None`) and add a guard at the top of the function (in modelopt/torch/utils/plugins/megatron_preprocess_data.py) that sets `json_keys = ["text"]` when `json_keys is None`; reference the parameter name `json_keys` to locate and update the code and ensure any type hints are adjusted accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/megatron_bridge/README.md`:
- Around line 103-111: The bash example uses inline comments after a
backslash-continued line (e.g., the lines containing `--json_keys text \ #
change to your JSON key if needed` and `--tokenizer Qwen/Qwen3-0.6B \ # fyi,
all models...`), which breaks line continuation; remove those trailing inline
comments or move them to their own lines above the command, or place comments
only on the final non-continued line, and ensure every `\` is the last character
on the line so the `python -m
modelopt.torch.utils.plugins.megatron_preprocess_data` command and flags
(`--jsonl_paths`, `--json_keys`, `--tokenizer`, `--output_dir`, `--workers`,
`--max_sequence_length`) are valid when copy-pasted.
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 301-315: Move the creation of the output directory to before
calling _download_hf_dataset so the downloader can write files; specifically,
call Path(output_dir).mkdir(parents=True, exist_ok=True) prior to invoking
_download_hf_dataset(dataset, output_dir, json_keys, subset=subset, split=split)
in the code block that contains the dataset/download logic, leaving the rest of
the file discovery logic for input_dir/jsonl_paths unchanged.
- Around line 229-255: _download_hf_dataset currently builds jsonl_file_path
using ds_path which may contain slashes and never ensures parent directories
exist, causing FileNotFoundError when ds.to_json is called; fix by sanitizing or
using only the dataset name (not the HF org prefix) when composing
jsonl_file_path and ensure the parent directory is created before writing:
derive a safe filename from ds_path (or replace '/' with '_' or use
Path(ds_path).name), compute parent = Path(output_dir)/"raw"/<safe_name> and
call parent.mkdir(parents=True, exist_ok=True) before checking/creating
jsonl_file_path and before calling ds.to_json; also ensure output_dir is created
earlier (or rely on the same mkdir(parents=True)) so writes never fail.
In `@tests/_test_utils/torch/megatron/utils.py`:
- Line 23: The shared utils module uses skip_if_no_megatron(te_required=False)
which allows import to run TE-dependent megatron.core model imports (e.g.,
GPTModel, MambaModel, GPTInferenceWrapper) and causes ImportError when TE is
missing; update the guard so the module is only imported when TE is available by
changing the skip invocation to skip_if_no_megatron(te_required=True) (or, if
utility functions truly don't need TE, move the skip_if_no_megatron call to
after the TE-free helper definitions and keep TE imports behind a separate
guarded block); ensure initialize_for_megatron and any TE-dependent symbols are
only loaded under the TE-required guard to keep imports consistent with
tests/models.py.
In `@tests/unit/torch/utils/test_megatron_preprocess_data.py`:
- Around line 98-121: The test test_megatron_preprocess_data_with_hf_dataset
makes live network calls and also writes JSON into a non-existent raw/
subdirectory in _download_hf_dataset; update the test to skip in offline CI by
adding a guard (e.g., pytest.mark.skipif or an env-check like
SKIP_NETWORK_TESTS) before running megatron_preprocess_data, and in
_download_hf_dataset ensure the target directory (the constructed raw/ path) is
created (os.makedirs(..., exist_ok=True)) before calling ds.to_json(); reference
the test function name test_megatron_preprocess_data_with_hf_dataset, the helper
_download_hf_dataset, and the call sites around load_dataset and ds.to_json when
implementing these changes.
---
Outside diff comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 172-179: The loop variable i is undefined when encoded_docs is
empty, causing UnboundLocalError at the final self._print_processing_stats(i,
...) call; fix by initializing i (e.g., i = 0) before the for i, (doc,
sentence_lens, (doc_len, enc_len)) in enumerate(...) loop or by using a safe
fallback when calling self._print_processing_stats (e.g., pass i if defined else
0); update the code around the enumerate block in megatron_preprocess_data.py so
that i is always defined before the final force_print call.
---
Nitpick comments:
In `@modelopt/torch/prune/plugins/mcore_minitron.py`:
- Line 320: The comment TODO notes that the name hybrid_override_pattern should
be renamed to hybrid_layer_pattern after Megatron-LM#3377; create a tracking
issue in your repo (or project board) documenting this rename, include the file
and symbol names (hybrid_override_pattern in
modelopt/torch/prune/plugins/mcore_minitron.py), mention the PR that triggers it
(NVIDIA/Megatron-LM#3377), and list required follow-ups (update the TODO
comment, rename the symbol across codebase, adjust any docs/tests that reference
hybrid_override_pattern) so the change won't be lost.
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Line 271: The parameter json_keys is defined with a mutable default list
(`json_keys: list[str] = ["text"]`), so change the function signature to use
None as the default (e.g., `json_keys: list[str] | None = None`) and add a guard
at the top of the function (in
modelopt/torch/utils/plugins/megatron_preprocess_data.py) that sets `json_keys =
["text"]` when `json_keys is None`; reference the parameter name `json_keys` to
locate and update the code and ensure any type hints are adjusted accordingly.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
examples/megatron_bridge/README.mdexamples/megatron_bridge/distill.pyexamples/nemo_run/common/process_climbmix.pymodelopt/torch/prune/plugins/mcore_minitron.pymodelopt/torch/utils/plugins/megatron_preprocess_data.pytests/_test_utils/torch/megatron/utils.pytests/unit/torch/utils/test_megatron_preprocess_data.py
fe545ef to
badcfd1
Compare
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
badcfd1 to
b805c01
Compare
What does this PR do?
Improve megatron dataset preprocessing script and update docs
Usage
Testing
Summary by CodeRabbit
Documentation
New Features
Configuration Updates